Journal of Bioinformatics and Systems Biology — Latest Matching Preprints

1

Binary search and and set operations on compacted k-mer lists

Dufresne, Y.; Andreace, F.

2026-07-03 bioinformatics 10.64898/2026.06.29.735436 medRxiv

Top 0.1%

1.6%

Show abstract

Sorted lists of elements are particularly good for computing set operations. A single scan of the two lists is sufficient to materialize or count the results of the union, intersection, difference, and xor operators. In bioinformatics, only a few tools are designed to perform these operations on k-mers. A fast tool like KMC allows set operations at the cost of storing individual k-mers. In this paper, we introduce a novel way to represent sorted k-mers as a collection of recomposed super-k-mer sorted lists. We introduce the concept of virtual super-k-mer and show how to construct, query and perform set operations on sorted lists of virtual super-k-mers. In the implementation sklib, we demonstrate high throughput of the data structure for construction and set operations, while remaining competitive in query capabilities, within a controlled memory footprint (2-5x decrease in bits/element compared to KMC).

2

Semi-automated reconstruction of glomerular architecture from 3D confocal microscopy data

Loyd, Y. M.; Chase, S. E.; Krendel, M.

2026-07-10 cell biology 10.64898/2026.07.03.736410 medRxiv

Top 0.2%

1.0%

Show abstract

Nephrons are the functional units of the kidney; within each nephron, the glomerulus is the initial site of selective filtration that allows removal of waste products while preserving proteins in the bloodstream. Each glomerulus consists of a network of capillaries surrounded by specialized epithelial cells, podocytes, which mediate selective filtration. Abnormalities in glomerular structure impair renal function, resulting in proteinuria and kidney disease. Although several microscopy-based approaches exist to characterize glomerular architecture and structural abnormalities, quantitative analysis is often limited by labor-intensive image segmentation. In this study we present a semi-automated approach for segmentation and analysis of glomerular architecture from three-dimensional confocal microscopy data. Using mTmG transgenic mice that express membrane-associated EGFP in podocytes and membrane-associated tdTomato across all other cell types, we reconstruct podocyte processes and glomerular capillaries from volumetric renal images. This semi-automated approach reduces manual segmentation effort and supports more efficient, standardized analysis of glomerular architecture in three-dimensional confocal microscopy datasets.

3

Drivers of Diagnostic Variation in a Digital Global Kidney Transplant Reader Study

Hofstraat-Boersma, R.; du Long, R.; Buzzanca, G.; Abiola, A. A.; Albadri, S.; Ali, Z.; Altaleb, A.; Angioi, A.; Banu, S. G.; Barry, M.; Bhalodia, A. R.; Bianco, P.; Broecker, V.; Buelow, R.; Chauveau, B.; Chen, G.; Cheunsuchon, B.; Crisi, G. M.; Daneshvar, S.; Dendooven, A.; Dokouhaki, P.; Drachenberg, C. B.; Farris, A. B.; Ferlicot, S.; Florquin, S.; Fontana, F.; Gibier, J.-B.; Gibson, I. W.; Gujarathi, S.; Hendricks, A. R.; Husain, S.; Islam, J.; Ismail, W.; Jagannathan, G.; Klager, J.; Kozakowski, N.; Krizova, A.; Kurien, A. A.; Kwon, B.; L'Imperio, V.; Ledesma, F. L.; Low, J. P.; Martin, J

2026-07-13 pathology 10.64898/2026.07.09.26357318 medRxiv

Top 0.4%

0.6%

Show abstract

Background Diagnostic interpretation of kidney allograft biopsies using the Banff classification remains variable, but the determinants of this variability are not fully defined. We performed a global, fully digital multi-reader study to identify the principal drivers of disagreement in Banff-based assessment. Methods Thirty six kidney transplant biopsies were independently scored by 67 renal pathologists on a standardized digital platform. Readers assessed Banff lesions on hematoxylin and eosin, periodic acid Schiff, and Jones' silver stains; final diagnostic categories were assigned using prespecified Banff-based decision rules. Interobserver agreement was quantified with Gwet's agreement coefficient (AC) statistics. Determinants of diagnostic agreement were evaluated) using pairwise mixed-effects logistic regression, and reader similarity was examined by principal component analysis (PCA) with post hoc molecular annotation. Results Agreement for final diagnostic categories was moderate (Gwet's AC1, 0.55; 95% CI, 0.47 - 0.63). Lesion-level agreement varied substantially, with lowest agreement for selected threshold-dependent inflammatory or semi-quantitative lesions, including interstitial inflammation in areas of IFTA, peritubular capillaritis and arteriolar hyalinosis. Diagnostic concordance differed markedly across biopsies, indicating strong case-level heterogeneity. In pairwise models, differences in active inflammatory and vascular lesion scoring were the strongest correlates of diagnostic disagreement; reader experience and geography contributed minimally. Principal component analysis showed reader variation was organized along two dominant axes: a rejection-calling threshold axis linked mainly to tubulointerstitial inflammatory injury, and a T cell-mediated (TCMR/TI) and antibody-mediated/microvascular (AMR/MVI) inflammation-oriented phenotypic classification axis. Conclusion Interobserver variation in Banff-based kidney transplant biopsy assessment is structured rather than random and driven mainly by how readers threshold and integrate key inflammatory lesion compartments rather than experience or geographic location.

4

Ambient AI Documentation in Clinical Genetics: Perspectives on Implementation and Impact on Burnout

Narain, A.; Misurac, J.; Van Tiem, J.; LaSpisa, C.; Campbell, C. A.

2026-07-02 genetic and genomic medicine 10.64898/2026.06.30.26356723 medRxiv

Top 0.6%

0.5%

Show abstract

Objectives: To assess genetic counselors perspectives on ambient AI adoption and its impact on counselor burnout. Materials and Methods: We utilized a mixed methods approach, surveying burnout using the validated Stanford Professional Fulfilment Index (PFI) before and after ambient AI adoption and exploring adoption perspectives through semi-structured interviews. Results: 64% of participants (16/25) completed the pre-survey, with eleven completing post-surveys (69% response rate for completion of all three surveys). 14/25 participants completed interviews. Ambient AI use was associated with reduction in burnout after 90 days; respondents who reported using ambient AI (vs. non-use) had burnout scores 1.05 points lower, on average (p=0.008). Benefits of adoption included effective use with interpreters, memory aid, summarization of non-templated note sections (e.g. family/social history), and improved patient engagement. Challenges included template customization, variable accuracy, oversimplified medical language, and rapport disruption during consent. Ethical and regulatory considerations included data privacy, bias, awareness of training resources, and concerns about job displacement. Discussion: Ambient AI documentation can reduce documentation burden and burnout among genetic counselors. By evaluating both outcomes and real world implementation considerations, our study provides evidence to guide scalable integration of AI enabled documentation tools in clinical genomic medicine. Conclusion: Ambient AI can help support the sustainability of the clinical genetics workforce as genomic medicine initiatives are scaled across health systems. Addressing genetics-specific documentation needs while prioritizing patient trust, transparency, and provider oversight is essential for responsible ambient AI implementation.

5

Voclosporin Preserves Mitochondrial Function Compared With Cyclosporine A in Perfused Human Proximal Tubule Microphysiological Systems

Aryeh, K. S.; Tsang, Y. P.; Hsu, E. W.; Yeung, C. K.; MacDonald, J.; Bammler, T. K.; Himmelfarb, J.; Rehaume, L. M.; Kelly, E. J.

2026-07-11 pharmacology and toxicology 10.64898/2026.07.07.737071 medRxiv

Top 0.6%

0.5%

Show abstract

Key PointsO_LIPerfused human kidney MPS revealed CsA-associated sublethal tubular stress that was not detected by conventional 2D viability assays or by KIM-1 release in 3D MPS. C_LIO_LIAt matched exposure, VCS preserved mitochondria and activated ER chaperones and iron detoxification, with no p21 arrest compared to CsA. C_LIO_LIMechanistic separation supports VCSs nephroprotection potential and early mechanism-based biomarkers to guide CNI choice. C_LI O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=113 SRC="FIGDIR/small/737071v1_ufig1.gif" ALT="Figure 1"> View larger version (46K): org.highwire.dtl.DTLVardef@e5e01dorg.highwire.dtl.DTLVardef@1dc9167org.highwire.dtl.DTLVardef@1ce22f8org.highwire.dtl.DTLVardef@5a053a_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOGraphical AbstractC_FLOATNO C_FIG BackgroundCalcineurin inhibitors (CNIs) are indispensable for transplantation immunosuppression, yet cyclosporine A (CsA) produces renal toxicity. Voclosporin (VCS), a CsA analog, is proposed to be less nephrotoxic, but mechanisms remain unclear. MethodsPrimary human proximal tubule epithelial cells (PTECs) were exposed to CsA or VCS in 2D monolayers and perfused 3D kidney microphysiological system (MPS). Viability was assessed in 2D cultures by MTS, mitochondrial membrane potential ({Delta}{Psi}m) by TMRM flow cytometry, and soluble injury and inflammatory biomarkers in MPS effluents by ELISA and MSD multiplex assays. RNA sequencing of 3D-cultured PTECs was used to identify differentially expressed genes and pathways. ResultsIn 2D PTECs, neither drug reduced viability. In 3D MPS effluents, KIM-1 did not distinguish CsA from VCS, whereas the MSD biomarker panel showed larger aggregate deviation with CsA. Confocal tomography showed CsA-associated mitochondrial fragmentation, whereas VCS preserved reticular mitochondrial architecture. TMRM flow cytometry showed a treatment-dependent difference in TMRM-positive cells, with VCS yielding the highest TMRM-positive fraction and exceeding CsA, supporting preservation of {Delta}{Psi}m relative to CsA. RNA-seq identified 1188 CsA-specific and 185 VCS-specific differentially expressed genes, with 304 shared. Pathway analysis indicated CsA enrichment of unfolded protein response (UPR) and endoplasmic reticulum (ER) stress, p21-associated G2/M checkpoint arrest, and transcriptional signatures consistent with ferroptosis priming, while VCS mainly induced ER chaperone and ER-associated degradation gene programs without activating canonical UPR sensors and showed limited cell-cycle suppression. ConclusionsA physiologically relevant 3D kidney MPS revealed sublethal tubular stress from CsA that is masked in 2D culture, including mitochondrial depolarization, proteostatic stress, and ferroptosis priming. At matched exposure, VCS preserved mitochondrial function and proteostasis while eliciting a narrower, adaptive ER quality control response. These data support VCS as a nephron-sparing immunosuppressant and 3D MPS as a mechanism-based platform for evaluating renal safety of drugs and nominating early sub-lethal tubular injury biomarkers.

6

Bamsnap-LRS: an automated batch visualization tool for long-read sequencing alignments

Chen, W.; Yang, C.; Qiu, L.; Hu, J.; Zhou, Y.

2026-06-25 bioinformatics 10.64898/2026.06.21.733121 medRxiv

Top 0.6%

0.5%

Show abstract

Summary: Long-read sequencing (LRS) has become essential for genome assembly, structural variations (SVs) detection, haplotype phasing and transcript isoform characterization. However, these applications often require manual inspection of read alignment for validation. Existing visualization tools are either interactive genome browsers that are difficult to scale to large datasets or batch-oriented tools that are not optimized for the unique alignment patterns of long-read data. We developed Bamsnap-LRS, an automated command-line tool for high-throughput LRS alignment visualization. It supports long-read-specific features, phased SNP inspection, and publication-ready batch figure generation within a unified framework for genomic, transcriptomic, and haplotype-aware analyses. Availability and Implementation: All codes and examples are freely available at https://github.com/comery/Bamsnap-LRS.

7

Genomic Annotation Infrastructure (GAIn): Pipelines and Resource Repositories for Annotating Variants, Positions, and Regions

Cokol, M.; Chorbadjiev, L.; Lee, Y.-h.; Jamsandekar, M.; Gergova, I.; Todorov, I.; Iossifov, I.

2026-07-12 bioinformatics 10.64898/2026.07.08.737273 medRxiv

Top 0.7%

0.4%

Show abstract

Interpretation of genomic variants, positions, and regions depends on reliable annotation--adding evidence such as predicted effect, conservation, population frequency, and gene-level context--yet the underlying resources are numerous, versioned, and assembly-specific. We present the Genomic Annotation Infrastructure (GAIn), a platform that generates transparent, reproducible annotations via declarative pipelines that define annotation tasks as ordered lists of components, called annotators, that produce annotation attributes using genomic resources from Genomic Resource Repositories (GRRs). We provide two public GRRs: a main repository containing more than 250 heterogeneous genomic resources, and a separate GRR-ENCODE repository containing resources derived from thousands of ENCODE (Encyclopedia of DNA Elements) project experiments. Users can use the annotation pipelines we made available, author custom annotation pipelines, and execute annotation tasks with these pipelines via GAIns web and command-line interfaces. The web interface can be used without any setup, but it relies on shared computational infrastructure and imposes limits on the size of annotation tasks. The command-line interface requires setup but supports arbitrarily large annotation tasks through simple-to-use parallelization and offers a broader set of features. For example, command-line GAIn can be extended by using custom GRRs or creating custom annotators via its plugin architecture. In addition, GAIns re-annotation feature, which updates annotations as they evolve, substantially simplifies maintaining annotations in a large genomics analysis project. GAIns resource management, explicit versioning, and pipeline abstraction provide an auditable, maintainable, and efficient foundation for modern genomic annotation across reference assemblies and use cases.

8

GBZ-base and GAF-base: Indexed pangenome file formats

Siren, J.; Paten, B.; the Human Pangenome Reference Consortium,

2026-07-11 bioinformatics 10.64898/2026.07.10.737775 medRxiv

Top 0.9%

0.4%

Show abstract

MotivationExisting pangenome file formats are designed for batch processing. Graphs must be loaded into memory, and alignment files must be read sequentially. Indexed file formats that can be used directly from disk would be more appropriate for interactive applications. ResultsWe propose GBZ-base and GAF-base -- SQLite-backed file formats comparable to GBZ and GAF. GBZ-base supports efficient extraction of local subgraphs, and GAF-base lets us extract all alignments to the subgraph. Additionally, GAF-base is smaller than any other file format for sequence-to-graph alignments. Availability and implementationFrom https://github.com/jltsiren/gbz-base and https://crates.io/crates/gbz-base under the MIT license.

9

A blinded, counterbalanced rater design for evaluating AI-assisted summarisation of tertiary clinical genomics reports: methodology of the QNOMX-VHIR-CPSP-001 Phase 1 study

Creeden, J.; Olivecrona, M.; Soriano, A.

2026-06-22 genetic and genomic medicine 10.64898/2026.06.11.26355467 medRxiv

Top 0.9%

0.4%

Show abstract

Background. Tertiary clinical genomics reports condense layered molecular findings into documents that treating oncologists must read, translate, and act upon; manual summarisation of these reports is time-consuming and variable. Tools that assist summarisation and translation into local languages are emerging, yet the field lacks an agreed methodology for evaluating such tools before any downstream clinical use. The appropriate first endpoint is fidelity of the generated summary to its source report, assessed by qualified human raters under blinded scoring, not downstream variant classification. Methods. QNOMX-VHIR-CPSP-001 Phase 1 is a single-site, non-interventional clinical performance study conducted at Vall d'Hebron Institut de Recerca (VHIR) under ISO 20916:2019 as a Clinical Performance Study Protocol. De-identified tertiary cancer genomics reports from pediatric oncology cases are summarised by the AI-assisted summarisation system under evaluation and, in parallel, by the standard manual workflow. Qualified raters score both summary types against the source genomics report using the Quality Summary Index (QSI), a six-dimension, five-point rubric adapted from the Provider Documentation Summarization Quality Instrument, under a blinded, counterbalanced, two-period crossover with a minimum fourteen-day washout. Two co-primary composite endpoints, content and presentation, are analysed for non-inferiority under a Bayesian hierarchical model, with a frequentist linear mixed model as the convergence check. Inter-rater reliability is reported as Krippendorff's ; a Monte-Carlo power analysis of the fixed clustered design is pre-specified. Discussion. The design isolates summarisation quality from clinical decision-making by scoring both summary types against the same source report under blinding, counterbalancing, and a fourteen-day washout. Conclusion. The QSI rubric, the counterbalanced crossover, and the pre-specified Bayesian primary with frequentist convergence check define a replicable protocol for early-stage evaluation of AI-assisted summarisation in tertiary genomics reporting; observed variance components will inform sample-size determination for Phase 2.

10

Client-server interfaces enable efficient agent-driven variant calling

Yu, X.; Zheng, Z.; CHEN, L.; QIn, Z.; Guo, X.; He, M.; Luo, R.

2026-06-28 bioinformatics 10.64898/2026.06.25.734665 medRxiv

Top 0.9%

0.4%

Show abstract

BackgroundLarge language model (LLM) agents increasingly automate bioinformatics analyses, but most existing bioinformatics tools were built for standalone use by human experts. An agent driving such a tool must reason about its installation, configuration, and execution from documentation for human, spending many turns, tokens, and tool calls per result. How a method is exposed to an agent can therefore matter as much as the method itself. By designing agentic interfaces for these tools, agent can reduce such overhead and improve the reliability of agent-driven analyses. FindingsTo test this design, we re-architected Clair3, a widely used deep-learning-based long-read variant caller, into a client-server system, Clair3-Connect. The client performs all genomics related processing and holds the identifiable data. The server runs only neural-network inference, and the client sends only feature tensors to the server, while sample identifiers and genomic context remain on the client. The client exposes schema-defined agent-facing tools that an agent invokes through single structured calls. On an APOE diplotyping task, all 60 agent runs were correct. The agentic tools used 12K tokens in 3 turns, 6.8 to 14 times fewer tokens than the shell-driven baselines (81K-163K tokens), at about a quarter the wall-clock time and far more stably (4% versus 35% token usage variation). Dropping the pileup and phasing stages to keep the client light left SNP F1 within 0.1-0.3 points of standard Clair3 by 50x coverage, while mutual TLS and AES-256-GCM encryption added 7.2% to end-to-end runtime. ConclusionsRecasting an established algorithm as developer-built, agentic tools behind a secure client-server boundary makes it more efficient, reliable, and easier to deploy for an LLM agent than a third-party wrapper, which cannot recover the defaults and conventions only its developers know. Agentic interfaces should be a first-class deliverable of bioinformatics tool development.

11

LocusBlend: Flexible multi-index regional visualization of genomic association signals

yang, c.; Cook, N.; Zeng, Y.; Fu, T.; budde, J.; Cruchaga, C.; Belloy, M. E.

2026-07-21 genetic and genomic medicine 10.64898/2026.07.15.26358129 medRxiv

Top 1.0%

0.3%

Show abstract

Summary It has become standard practice to visualize regional signals from genomewide association studies GWAS using LocusZoom plots Similarly GWAS signals are compared to regionally matched quantitative trait loci QTLs ie varianttogene regulation data using LocusCompare plots to aid assessment of candidate traitrelated genes Despite broad usage these tools annotate variants by linkage disequilibrium LD to a single lead or index variant This singleindex representation has limitations for visualizing complex loci that contain multiple independent signals We present LocusBlend an interactive web application for multiindex LDblended visualization of genomic loci LocusBlend supports one or two genomic association summarystatistic datasets and one to three index variants multiindex LocusZoom colorblended plots and matching LocusCompare visualizations Applications to Alzheimers disease GWAS and QTL signals illustrate LocusBlend enables visualization and separation of independent signals despite shared LD and high genomic complexity Overall LocusBlend is aimed at supporting researchers handle the continuously expanding complexity of human genomics findings Availability and Implementation LocusBlend is freely available at httpslocusblendwustledu Publication ready plots are generated in 1min Source code documentation example datasets input templates and reproducibility instructions are available at httpsgithubcomBelloyLabLocusBlend LocusBlend is implemented in Python using Streamlit Plotly and PLINK Supplementary Information Supplementary data are available online

12

Testing Reversibility of Endosymbiotic Gene Transfer between Chloroplast and Nucleus

Su, D.; Chen, S.-A.; Hammer, P.; Chacko, E.; Beilinson, V.; Kinev, A.; Onishi, M.

2026-07-10 cell biology 10.64898/2026.07.03.736199 medRxiv

Top 1%

0.3%

Show abstract

Most proteins targeted to the organelles of endosymbiotic origin are encoded in the nuclear genome, placing them under the regulatory dominance of the nucleus. For photosynthetic eukaryotes, nuclear-encoded chloroplast proteins arise via two routes: First, genes of cyanobacterial origin were relocated to the nucleus through endosymbiotic gene transfer (EGT). Second, proteins of eukaryotic origin emerged to support chloroplast function and structure. These proteins are reimported into the chloroplast via an import machinery. Reversing the transfer of such genes from the nucleus to the chloroplast genome may offer insights into chloroplast regulation and evolution. In this study, we established a highly efficient and accessible electroporation protocol for chloroplast transformation in the green alga Chlamydomonas reinhardtii, and used it to reverse-transfer two nuclear-encoded genes encoding proteins arising via the two routes described above: the cyanobacteria-derived chloroplast division protein FtsZ1 and the Rubisco-linker EPYC1 of eukaryotic origin. Regardless of origin, both chloroplast-encoded FtsZ1 and EPYC1 showed proper localization and functionality comparable to their nuclear-encoded counterparts. Together, our study provides a robust protocol for chloroplast transformation, a platform for investigating the evolutionary drivers of EGT, and a foundation for advancing chloroplast bioengineering. SIGNIFICANCE STATEMENTO_LIEndosymbiotic gene transfer has resulted in the mass migration of genes from the chloroplast genome to the nuclear genome. Reversing the gene transfer could reveal the evolutionary significance of genome partitioning. C_LIO_LIUsing the green alga Chlamydomonas reinhardtii, this study developed an efficient, electroporation-based protocol for chloroplast transformation. Relocating the genes encoding two chloroplast-targeted proteins, FTSZ1 and EPYC1, to the chloroplast genome showed that the proteins maintained normal localization and function. C_LIO_LIThe established transformation protocol facilitates systematic testing of reverse gene transfer to elucidate the potential evolutionary advantages of genome partitioning and opens new avenues for chloroplast bioengineering. C_LI

13

Can a Tissue-derived Progression Signature Accurately Predict Colorectal Cancer Stage Transitions in Blood?

Sarkar, P.; Sarkar, P.

2026-06-29 bioinformatics 10.64898/2026.06.23.734006 medRxiv

Top 1%

0.3%

Show abstract

Colorectal cancer (CRC) is challenging to track because its molecular changes are very complex as the disease progresses, creating significant challenges for robust biomarker discovery. In this study, we developed a machine learning framework by integrating monotonic progression and the StepMiner approach. We conducted external validation to identify reproducible, consistent transcriptomic biomarkers associated with CRC progression. Gene expression datasets were analyzed across four disease states from publicly available GEO: normal colon, adenoma, primary colorectal cancer, and metastasis. First, we identified genes with monotonic expression, then used the StepMiner approach to identify genes that act as switches between stages. A balanced 74-gene signature was used for machine-learning classification with a Random Forest. External validation showed strong performance in tissue-based datasets. However, tissue-derived signatures and plasma and blood-based datasets showed poor performance, highlighting biological differences between transcriptomic profiles. Cross-filtering between tissue-derived genes and blood expression datasets was performed, which resulted in the selection of 62 blood-compatible gene signatures. Leakage-free retraining on GSE164191 achieved a mean AUC of 0.868 with balanced precision. Functional enrichment analysis showed that these genes are highly active in cancer growth. Specifically, genes CBX3, S100A11, PDK4, NCOR1, and SOX4 demonstrated stable and reliable performance across the validation fold. Overall, our study presents a progression-aware transcriptomic framework for CRC biomarker discovery and demonstrates the importance of external validation. Additionally, we evaluate whether tissue-derived signatures can predict blood profiles. This proposed approach may help the future development of tissue-based diagnostics and minimally liquid-biopsy strategies for CRC. To ensure reproducibility, our proposed workflow was automated as a Nextflow pipeline. The tissue-derived model was deployed as an application utilizing Angular, ASP.NET Core, and Plumber (R).

14

CNSigs: An R Package for the Identification of Copy Number Mutational Signatures

Tallman, D.; Striker, S.; Byappanahalli, A. M.; Stockard, S.; Jenison, J.; Collier, K. A.; Blige, E.; Vater, M.; Stover, D. G.

2026-06-25 bioinformatics 10.64898/2026.06.21.733646 medRxiv

Top 1%

0.3%

Show abstract

BackgroundCopy number aberrations (CNAs) are gains and losses of large genomic segments present across most cancer types and are a hallmark of cancer genomic alterations. However, the processes underlying CNAs and characteristic patterns of CNAs are poorly understood. Bioinformatic advances have identified underlying single nucleotide variant (SNV) mutational signatures resulting from distinct mutational processes, yet development of algorithms able to uncover similar signatures for CNAs remains less advanced. MethodsUsing segmented data files from DNA sequencing, six copy number features are extracted for signature determination: segment size, breakpoints per 10 megabases, copy number oscillation events, average changepoint size, average copy number, and breakpoints per chromosome arm, along with ploidy. Mixed model approaches and non-negative matrix factorization (NMF) are utilized to derive CNA signatures across cancer types. The full methodology was packaged in a robust R package, termed CNSigs that is publicly available. ResultsTo verify the reproducibility of the signatures, we derived five signatures from two independent breast cancer datasets (total n>3000), demonstrating high accuracy (average cosine similarity = 0.89). Pan-cancer application of CNSigs in the TCGA dataset resulted in derivation of 13 pan-cancer signatures which were significantly associated with disease-specific survival. Benchmarking CNSigs to two other CNA signature approaches within TCGA demonstrated non-overlapping signatures and favorable compute speed for CNSigs. We evaluated n=24 pairs of tumor and circulating tumor DNA (ctDNA) acquired at the same time and demonstrated that CNSigs are detectable and reproducible via ctDNA, with significant association of CNSig11 with metastatic triple-negative breast cancer progression-free survival for taxane but not platinum or capecitabine chemotherapy. CNSigs association with immunophenotype was evaluated in low-grade glioma (LGG) and CNSig 3 was found to be highly prognostic for LGG yet complementary to immune features. ConclusionsThe CNSigs R package allows researchers to easily analyze their own samples to derive copy number signatures and evaluate clinical associations. We demonstrate potential application in ctDNA and association with treatment response. The development of this package allows further investigation of underlying processes that may be responsible for these CNA fingerprints.

15

EcoXAI: Autonomous Agentic Ecosystem for Explainable Artificial Intelligence and Biomedical Discovery

Matsumoto, N.; Choi, H.; Freda, P. J.; Hernandez, M. E.; Wang, Z. P.; Moore, J. H.

2026-07-13 bioinformatics 10.64898/2026.07.08.737358 medRxiv

Top 1%

0.3%

Show abstract

MotivationAs biomedical datasets and knowledge graphs continue to grow in size, complexity, and heterogeneity, navigating and extracting actionable insights from them presents a major bottleneck for researchers. There is a clear need for autonomous analytical solutions that can utilize recent advancements in agentic AI such as agent harnessing and loop engineering without introducing hallucination or workflow fragmentation. Researchers, regardless of technical expertise, need tools that streamline complex data analysis and deliver meaningful, actionable insights grounded in both data and established biomedical knowledge. EcoXAI addresses this by introducing a modular, customizable, containerized multi-agent system that structures analysis into explicit pipeline execution stages, lowering the computational barrier for clinical and translational researchers. ResultEcoXAI replaces monolithic AI text interfaces with an autonomous execution-driven framework with specialized bioinformatics agents for delivering proactive, data-driven insights grounded in established biological knowledge. Unlike purely LLM-driven or less integrated AI solutions prone to hallucinations or biologically implausible outcomes, EcoXAIs multi-agent framework, which leverages modern agentic management and explicit knowledge graph integration, provides greater transparency and verifiability in its reasoning. In our use case in drug repurposing for Alzheimers Disease, EcoXAI evaluated 103 drug candidates and identified 79 novel candidates whose predictive models exceeded a randomized baseline, including the CCR5 antagonist Maraviroc, whose generated hypothesis was subsequently supported by the literature. These results demonstrate the potential of knowledge graph-grounded AI agents to accelerate hypothesis-driven biomedical research. Availability and implementationEcoXAI is available on GitHub at: https://github.com/EpistasisLab/EcoXAI. Contactjason.moore@csmc.edu

16

Towards a Unified Exact Solution of Rearrangement Small Parsimony for Natural Genomes

Bohnenkaemper, L.; Frolova, D.

2026-06-28 bioinformatics 10.64898/2026.06.23.733974 medRxiv

Top 1%

0.3%

Show abstract

Phylogenetic reconstruction is a fundamental problem in comparative genomics. As a theoretical problem in rearrangement studies, this has been modelled as the Small Parsimony Problem (SPP), in which ancestral genome structures have to be determined minimizing the number of rearrangement events occurring throughout the phylogeny. This problem is of significant interest in microbial and cancer genomics, due to the prevalence and clinical importance of rearrangement events. Genome structures in this problem are expressed as sequences of markers, which are themselves oriented sequence features (such as genes) that abstract from non-structural variations. Recent research has focused on the problem under the natural genomes model, in which arbitrary variations in copy number of markers are allowed. Natural genomes are often studied under the DCJ-indel model, a model which has already been successfully applied to plasmid data. There also exist ILP solutions to a variant of the Small Parsimony Problem under the DCJ-indel model. However, these solutions are limited in their applicability, as they make some critical simplifications for tractability purposes: ancestral marker frequencies and precomputed putative ancestral adjancencies, with their predicted likelihoods, are assumed as input. This creates multiple problems from both a theoretical and practical perspective. Firstly, this simplification means that not the full state space is searched for a solution, but rather only the subset of genomes with the precomputed putative adjacencies, meaning an optimal solution to the exact SPP is not guaranteed. Secondly, marker frequencies are given externally, without any theoretical guarantees. Thirdly, the method used to precompute adjacencies relies on gene trees, which requires the use of genes as markers, when gene annotation is often unreliable, especially in regions with a lot of rearrangement. Additionally, this restricts the applicability of the approach to sets of genomes that are both divergent and large enough to be able to produce informative gene trees. This is, for example, rarely the case for plasmids, where nucleotide mutations are rarer than rearrangements and genomes are small. Hence, we revisit the problem to solve the exact SPP by introducing a cost to indel operations, which allows us to compute ranges of marker frequencies and derive theoretical results, that allow us to reduce the solution space that the ILP searches without sacrificing optimality. We show that this makes the problem tractable for the case of small and recently related genomes, first on simulated genomes, and then on a set of pathogenic plasmids which represent a realistic use case for the method.

17

pylimma: a faithful, AnnData-native Python port of R limma for differential expression analysis

Mulvey, J.

2026-07-10 bioinformatics 10.64898/2026.07.06.736732 medRxiv

Top 1%

0.3%

Show abstract

pylimma is a faithful Python port of limma, intended to bring one of the most widely used tools for differential expression analysis to the developing Python ecosystem for transcriptomics and proteomics. We validated pylimma against the existing R implementation through 227 function-level comparisons and across six real world datasets spanning microarray, RNAseq, proteomics and single-cell transcriptomics. pylimma reproduces limmas numerical output to a median agreement of 13 significant figures and calls identical sets of differentially expressed features and gene sets. This supports its use as a drop-in replacement for the R implementation.

18

Identification of a Novel Alternatively Spliced CRYBA1 Transcript in Unilateral Childhood Cataract Associated with Persistent Fetal Vasculature

Sankaranarayanan, R.; Vasavada, A. R.; Agrawal, D.; Vasavada, S. A.; Vasavada, V. A.

2026-07-13 genetic and genomic medicine 10.64898/2026.07.08.26357271 medRxiv

Top 1%

0.3%

Show abstract

Purpose: To identify transcript-level variants in crystallin genes in paediatric patients with unilateral cataracts. Methods: Anterior capsulorhexis (n=12) from patients underwent surgical management of congenital unilateral cataracts was collected. Total RNA was isolated from lens epithelial cells, and complementary DNA (cDNA) was synthesized. Full-length RNA transcripts of 10 lens-specific crystallin genes were PCR-amplified and analysed via Sanger sequencing. Identified transcript variants were further validated using genomic DNA (gDNA) through Sanger sequencing. In addition, the full-length (~7,535 bp) CRYBA1 genomic region was sequenced using Oxford Nanopore Technology. Results: Aberrant low molecular weight (LMW) amplicons (~370 bp) of the CRYBA1 transcript were identified in three patients presented with unilateral cataract. Of 3 patients, 2 had persistent fetal vasculature (PFV) and 1 had pre-existing posterior capsular defect (PPCD). Sanger sequencing revealed a precise loss of exons 2 to 4 in the CRYBA1 RNA transcript. No coding, splice-site, or large deletion variants were detected in the genomic DNA of the patients or their parents. In silico analysis predicted two possible truncated proteins arising from these alternatively spliced transcripts: one comprising the first 11 amino acids of the N-terminal region with a loss of all Greek key motifs, and another comprising 90 amino acids encoded by exons 5 and 6, initiated from an alternative start codon in exon 5, and loss of Greek key motifs 1 & 2. Conclusion: The precise skipping of exons 2 to 4, consistent with canonical splicing signals (5-prime-GU...AG-3-prime), in the absence of genomic alterations, suggests the presence of alternatively spliced (AS) CRYBA1 transcripts in human lenses. This is the first report documenting AS-CRYBA1 transcripts in association with childhood cataracts with PFV and PPCD.

19

Cardiologists perspectives on sociocultural and structural factors shaping cardiovascular genetic testing

Ramey, H. M.; Gabriel, J.; Morales, A.; Romagnoli, K.; Williams, M. S.

2026-06-24 genetic and genomic medicine 10.64898/2026.06.22.26356233 medRxiv

Top 1%

0.3%

Show abstract

Introduction: Genetic testing is increasingly central to the diagnosis and management of cardiovascular genetic conditions. However, use and follow-through vary across patient populations. Examining clinician perspectives on sociocultural and structural factors influencing testing is important for understanding these differences and informing public health genomics research and implementation efforts. Methods: We conducted semi-structured interviews with 15 cardiologists from health systems across the United States who have integrated cardiogenetics in their practice. Interviews explored experiences diagnosing cardiovascular genetic conditions among patients from underrepresented backgrounds, as well as approaches to incorporating social and contextual information into care. Data were coded thematically and analyzed using a framework analysis guided by the Health Equity Implementation Framework and Social Determinants of Health domains. Results: Clinicians described multi-level factors shaping genetic testing practices, including patient-provider interactions, clinical workflows, health system infrastructure, and broader policy contexts. Key themes included challenges communicating complex genetic information across language and literacy differences; patient trust shaped by prior healthcare experiences; fragmented insurance coverage separating genetic testing from genetic counseling; and challenges interpreting variants of uncertain significance, particularly for populations underrepresented in genomic reference databases. Clinicians also described adaptive strategies, such as interdisciplinary collaboration, telehealth, and patient assistance programs, that supported testing in some settings but were often inconsistent or resource-dependent. Conclusion: Among cardiologists using genetic testing, system-level and sociocultural factors shape the feasibility and downstream use of cardiovascular genetic testing. Findings highlight considerations for public health-informed genomic infrastructure that accounts for social context, supports communication, and reduces reliance on individual clinician workarounds, with implications for clinical decision support and related public health genomics initiatives.

20

Benchmarking long-read variant sensitivity across ONT and PacBio platforms using known clinically reported variants in a cohort of critically ill newborns

Marvin, C. T.; Devaney, J. M.; Buckingham, K. J.; Noya, J.; Shively, K. M.; Jacques, C.; Galey, M.; Storz, S. H.; Goffena, J.; Berlyoung, A. S.; Patterson, K. E.; Shaffer, T.; Zakarian, C.; McGee, S. R.; Smith, J. D.; Lochovsky, L.; Gustafson, J. A.; Sommerland, O. M.; Anderson, K.; Love-Nichols, J.; Facio, F. M.; Robertson, A. V.; Rowell, W. J.; Lake, J. A.; Carroll, A.; Miller, D. E.; Wei, C. L.; McWalter, K.; Wenger, T. L.; University of Washington Center for Rare Disease Research, ; Johnson, B.; Bamshad, M. J.; Chong, J. X.

2026-07-10 genetic and genomic medicine 10.64898/2026.07.07.26357482 medRxiv

Top 1%

0.3%

Show abstract

Long-read whole genome sequencing (lrWGS) shows promise as an all-in-one test to detect clinically relevant variants and variants difficult to detect by current short-read whole genome sequencing (srWGS) pipelines. Comparisons between lrWGS and srWGS (or exome sequencing) pipelines will become commonplace as lrWGS is more widely adopted for clinical testing, particularly for individuals not diagnosed by srWGS. However, the sensitivity of lrWGS for detecting variants previously identified and prioritized by clinical srWGS has yet to be assessed. As part of the SeqFirst-neo study, a subset of critically ill newborns and their parents who underwent clinical srWGS also underwent lrWGS on the Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) platforms. In total, 134 families were sequenced across multiple technologies including 128 families with clinical srWGS who were sequenced on both lrWGS platforms. We compared the variants reported by clinical testing with the variants identified by lrWGS. Among the 128 families sequenced on all three platforms, 89 SNV/indels and 14 SV/CNVs clinically reported by the srWGS testing pipeline were evaluated. All variants assessed in probands were ultimately detected by both lrWGS platforms, although three events were not detected prior to application of an updated variant caller, highlighting the rapid evolution of lrWGS variant calling. Additionally, breakpoint coordinates and event sizes often differed substantially between calls from srWGS and events called in lrWGS data. Our work demonstrates that while most clinically reported variants from srWGS can be detected by lrWGS pipelines, challenges remain when attempting direct comparisons, particularly for SV/CNVs.